Canopy Clustering Algorithm
   HOME

TheInfoList



OR:

The canopy clustering algorithm is an unsupervised pre- clustering algorithm introduced by
Andrew McCallum Andrew McCallum is a professor in the computer science department at University of Massachusetts Amherst. His primary specialties are in machine learning, natural language processing, information extraction, information integration, and social n ...
, Kamal Nigam and Lyle Ungar in 2000. It is often used as preprocessing step for the
K-means algorithm ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or ...
or the
Hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...
algorithm. It is intended to speed up clustering operations on large
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
s, where using another algorithm directly may be impractical due to the size of the data set.


Description

The algorithm proceeds as follows, using two thresholds T_1 (the loose distance) and T_2 (the tight distance), where T_1 > T_2 .McCallum, A.; Nigam, K.; and Ungar L.H. (2000
"Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching"
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178
# Begin with the set of data points to be clustered. # Remove a point from the set, beginning a new 'canopy' containing this point. # For each point left in the set, assign it to the new canopy if its distance to the first point of the canopy is less than the loose distance T_1. # If the distance of the point is additionally less than the tight distance T_2, remove it from the original set. # Repeat from step 2 until there are no more data points in the set to cluster. # These relatively cheaply clustered canopies can be sub-clustered using a more expensive but accurate algorithm. An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.


Applicability

Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the
curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...
. Only when a cheap and approximative – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means. Its benefits include: * The number of instances of training data that must be compared at each step is reduced. * There is some evidence that the resulting clusters are improved.Mahout description of Canopy-Clustering
Retrieved 2022-07-02.


References

{{Reflist Cluster analysis algorithms